Check if all listed python modules are installed (else pip install MODULE).
import kaggle
import sqlalchemy
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
import country_converter as coco
import plotly.express as px
import wbgapi as wb
from platform import python_version
from random import sample
from collections import Counter
from scipy.stats import zipf
Notebook was created using Python 3.8.5, jupyter-notebook 6.1.4; check current versions:
python_version()
'3.8.5'
!!jupyter --version
['jupyter core : 4.6.3', 'jupyter-notebook : 6.1.4', 'qtconsole : 4.7.7', 'ipython : 7.18.1', 'ipykernel : 5.3.4', 'jupyter client : 6.1.7', 'jupyter lab : not installed', 'nbconvert : 6.0.4', 'ipywidgets : 7.6.3', 'nbformat : 5.0.7', 'traitlets : 5.0.4']
This notebook will download the beyjin dataset csv files and create a database file in the current directory.
Using SQLite because:
If user management, scalability or network access required, consider MariaDB (open-source fork of the now-proprietary MySQL).
Interface with database using sqlalchemy module.
Create database file moviedb.sql in current directory.
moviedb = sqlalchemy.create_engine("sqlite:///moviedb.sql")
# check if database interface is working
with moviedb.connect() as con:
statement = "select 'hello '"
results = con.execute(statement)
print(results.fetchall())
[('hello ',)]
Note: kaggle api token (kaggle.json) required in home folder (see https://www.kaggle.com/docs/api).
kaggle.api.authenticate()
# THIS WILL OVERWRITE EXISTING FILES!
kaggle.api.dataset_download_files("beyjin/movies-1990-to-2017", unzip = True)
# shell call to see if files have downloaded and extracted
!!ls
['Beyjin_dataset_analysis.html', 'Beyjin_dataset_analysis.ipynb', 'Movie_Actors.csv', 'Movie_AdditionalRating.csv', 'Movie_Genres.csv', 'Movie_Movies.csv', 'Movie_Writer.csv', 'moviedb.sql']
csv files as pandas DataFrames¶def import_files(ext: str) -> dict:
"""Import all files with extension "str" from current dir as pd DataFrames.
Args:
ext: extension of files to import. The file is expected to be comma-separated value inside.
Returns:
dict of dfs: {filename without csv : DataFrame}"""
dfs = {}
for file in glob.glob('*.'+ext):
dfs[os.path.splitext(file)[0]] = pd.read_csv(file)
return dfs
dfs = import_files('csv') # DtypeWarning possible
c:\users\user\appdata\local\programs\python\python38-32\lib\site-packages\IPython\core\interactiveshell.py:3337: DtypeWarning: Columns (14) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)):
DtypeWarning: [...] have mixed types: this is expected for non-cosistent data and will be addressed downstream.
In Jupyter, cells are often executed out of order working on the same data, so it makes sense to replace database tables if they exists. In production, if_exists="replace" should be removed.
for key, df in dfs.items():
df.to_sql(key, moviedb, if_exists="replace")
Let's check the import using SQL:
with moviedb.connect() as con:
statement = "select name from sqlite_master where type = 'table' order by 1"
results = con.execute(statement)
print(results.fetchall())
[('Movie_Actors',), ('Movie_AdditionalRating',), ('Movie_Genres',), ('Movie_Movies',), ('Movie_Writer',)]
Let's have a look at the Movie_Movies DataFrames:
dfs['Movie_Movies'].shape
(178687, 18)
dfs['Movie_Movies']
| Awards | Country | DVD | Director | Language | Plot | Poster | Production | Rated | Released | Runtime | Title | Type | Website | Year | imdbID | imdbRating | imdbVotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | USA | NaN | Rose Cummings | English | Rachel constantly hears her baby cry from the ... | NaN | NaN | NaN | 26 Apr 2012 | 20 min | Baby's Breath | movie | NaN | 2012 | tt2268369 | NaN | NaN |
| 1 | NaN | USA | NaN | James Byrne | NaN | The struggle against unfortunate circumstances... | NaN | NaN | NaN | NaN | 9 min | Winter Trees | movie | NaN | 2008 | tt1560760 | NaN | NaN |
| 2 | NaN | USA | NaN | Dimitri Buchowetzki | NaN | NaN | NaN | NaN | NaN | 27 Mar 1926 | 50 min | The Crown of Lies | movie | NaN | 1926 | tt0016750 | NaN | NaN |
| 3 | NaN | USA | NaN | Julia Hechler | English | A Gift introduces Samuel Green, Washington Sta... | NaN | NaN | NaN | 27 May 2013 | 2 min | A Gift | movie | NaN | 2013 | tt3405286 | NaN | NaN |
| 4 | NaN | Sri Lanka | NaN | Udara Siriruwan | Sinhalese | NaN | NaN | NaN | NaN | 20 Mar 2014 | 23 min | Journey | movie | NaN | 2014 | tt3816698 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178682 | Nominated for 1 Oscar. Another 7 wins & 13 nom... | Germany, USA | 18 Apr 2000 | Rob Minkoff | English, Dutch | The Little family adopt a charming young mouse... | https://images-na.ssl-images-amazon.com/images... | Columbia Pictures | PG | 17 Dec 1999 | 84 min | Stuart Little | movie | http://www.spe.sony.com/movies/stuartlittle/sp... | 1999 | tt0164912 | 5.9 | 106,665 |
| 178683 | 3 wins & 9 nominations. | USA | 05 Sep 2000 | Oliver Stone | English | A behind-the-scenes look at the life-and-death... | https://ia.media-imdb.com/images/M/MV5BZDk2YTQ... | Warner Bros. Pictures | R | 22 Dec 1999 | 162 min | Any Given Sunday | movie | http://www.anygivensunday.net | 1999 | tt0146838 | 6.8 | 103,302 |
| 178684 | Won 1 Golden Globe. Another 4 wins & 23 nomina... | UK, Germany, Japan, USA | 30 May 2000 | Milos Forman | English | The life and career of a legendary comedian, A... | https://ia.media-imdb.com/images/M/MV5BNDI1Mjc... | Universal | R | 22 Dec 1999 | 118 min | Man on the Moon | movie | http://www.universalpictures.com/manonthemoon | 1999 | tt0125664 | 7.4 | 110,566 |
| 178685 | 7 wins & 14 nominations. | USA | 02 May 2000 | Dean Parisot | English | The alumni cast of a space opera television se... | https://ia.media-imdb.com/images/M/MV5BNmZlNTY... | DreamWorks SKG | PG | 25 Dec 1999 | 102 min | Galaxy Quest | movie | http://www.amazon.com/exec/obidos/subst/video/... | 1999 | tt0177789 | 7.3 | 132,448 |
| 178686 | Nominated for 5 Oscars. Another 10 wins & 75 n... | USA | 27 Jun 2000 | Anthony Minghella | English, Italian | In late 1950s New York, Tom Ripley, a young un... | https://images-na.ssl-images-amazon.com/images... | Paramount Pictures | R | 25 Dec 1999 | 139 min | The Talented Mr. Ripley | movie | http://www.talentedmrripley.com | 1999 | tt0134119 | 7.4 | 151,938 |
178687 rows × 18 columns
First-glance notes:
NaN values are spotted in all columns, so the data might be sparse. A closer manual look may yield more artifacts, but these two issues fit the project scope well, and dealing with them systematically may uncover additional issues.
Rigorous cleaning of the dataset should only be performed after more specific goals have been set and time has been allotted.
More specifically, how much of each column is empty? This will give an overall picture and also identify columns which carry more or less analytical, predictive or prescriptive value.
Let's obtain, for each DataFrames, the frequency of NaN values per column, and store these Series in a dict.
nan_freqs = {}
for key, df in dfs.items():
nan_freqs[key] = df.isna().sum().map(lambda x: x / df.shape[0])
nan_freqs
{'Movie_Actors': Unnamed: 0 0.0
Actors 0.0
imdbID 0.0
dtype: float64,
'Movie_AdditionalRating': Unnamed: 0 0.0
Rating 0.0
RatingSource 0.0
imdbID 0.0
dtype: float64,
'Movie_Genres': Unnamed: 0 0.0
Genre 0.0
imdbID 0.0
dtype: float64,
'Movie_Movies': Awards 0.886170
Country 0.071012
DVD 0.910743
Director 0.152971
Language 0.142657
Plot 0.454487
Poster 0.729057
Production 0.907962
Rated 0.866213
Released 0.345369
Runtime 0.331250
Title 0.000006
Type 0.000006
Website 0.962241
Year 0.000006
imdbID 0.000006
imdbRating 0.652616
imdbVotes 0.652862
dtype: float64,
'Movie_Writer': Unnamed: 0 0.0
Person 0.0
Responsibility 0.0
imdbID 0.0
dtype: float64}
We see that NaN values are only found in the Movie_Movies dataset - the rest have no empty cells at all.
Let's get a picture of which columns of Movie_Movies carry the least actual information:
# the quickest way of doing this is plotting directly from the Series object
nan_freqs['Movie_Movies'].plot(\
kind = 'bar'\
, figsize=(12,5)\
, rot = 45
, title='NaN fraction per column of Movie_Movies'\
);
We see that this part of the data set is fairly inhomogenous:
Title, Type, Year are (almost) completeCountry, Director, Language are mostly usefulRelease, Runtime, imdbRatings, imdbVotes and Plot are partially usefulDVD, Website, Production and Rated are very sparseColumn Awards appears problematic: because of the sparseness of the data, we cannot be sure if a row has no Awards entry because the movie recieved no awards, or if that particular data is missing. In other words, Awards likely has high sensitivity, but may have low specificity.
Depending on the application, a decision will need to be made about how to treat missing values, depending on the analysis goals.
Director column¶With regard to task 2 (top 10 directors), we see that a 27334 movies have no director specified. Let's have a look at some of them, sampling the list randomly:
movies_no_director = dfs['Movie_Movies'][dfs['Movie_Movies']['Director'].isna()]['Title']
movies_no_director.sample(10)
72724 The Deuce 46497 Neil Diamond: Solitary Man 42750 Choir Practice 3725 One Girl Swinging 76687 Ass Lovers 3 7711 All Over It 49215 Four Fools and a Maid 10237 Lisa Stansfield: Live at Ronnie Scott's 87544 Dance the World 50045 The Selkirks from Front of Train Name: Title, dtype: object
Sampling it a couple of times, it's probably safe to say that most of these should probably not be included in the directors rating, since they seem to consist of productions like
and therefore form outliers to the intended director's rating domain.
Before systematically analyzing the data, we should make sure that the data types throughout the DataFrames
Ideally, for some data, we prefer a specific data type (e.g, dates should be of type datetime), so we should convert whereever convenient.
Let's find the type of all entries, for all DataFrames, and then have Pandas list the unique types per column, storing the results in a dictionary. Ideally, we'd like 1 type per column.
unique_datatypes = {}
for key, df in dfs.items():
unique_datatypes[key] = df.applymap(type).apply(lambda col: col.unique())
unique_datatypes
{'Movie_Actors': Unnamed: 0 Actors imdbID
0 <class 'int'> <class 'str'> <class 'str'>,
'Movie_AdditionalRating': Unnamed: 0 Rating RatingSource imdbID
0 <class 'int'> <class 'str'> <class 'str'> <class 'str'>,
'Movie_Genres': Unnamed: 0 Genre imdbID
0 <class 'int'> <class 'str'> <class 'str'>,
'Movie_Movies': Awards [<class 'float'>, <class 'str'>]
Country [<class 'str'>, <class 'float'>]
DVD [<class 'float'>, <class 'str'>]
Director [<class 'str'>, <class 'float'>]
Language [<class 'str'>, <class 'float'>]
Plot [<class 'str'>, <class 'float'>]
Poster [<class 'float'>, <class 'str'>]
Production [<class 'float'>, <class 'str'>]
Rated [<class 'float'>, <class 'str'>]
Released [<class 'str'>, <class 'float'>]
Runtime [<class 'str'>, <class 'float'>]
Title [<class 'str'>, <class 'float'>]
Type [<class 'str'>, <class 'float'>]
Website [<class 'float'>, <class 'str'>]
Year [<class 'int'>, <class 'float'>, <class 'str'>]
imdbID [<class 'str'>, <class 'float'>]
imdbRating [<class 'float'>]
imdbVotes [<class 'float'>, <class 'str'>]
dtype: object,
'Movie_Writer': Unnamed: 0 Person Responsibility imdbID
0 <class 'int'> <class 'str'> <class 'str'> <class 'str'>}
All DataFrames, except Movie_Movies, are consistent. However, in Movie_Movies, we see that there are multiple types per column, excluding imdbRating. Year in particular requires attention, showing three different types, but so do the other columns.
It is possible that the float values are only caused by the NaN entries, as NaN in Pandas comes from numpy, where its type is, in fact, float (this can be mitigated by fillna); however, we should still check systematically.
The function defined below, inspect_df_types, generates a long text output, performing a type introspection of Movie_Movies. To improve readability, this output is attached at the end of the notebook, but its discussion follows below the inspect_df_types definition.
def inspect_df_types(df: pd.DataFrame, nan_series: pd.Series):
'''Print a detailed introspection of df data types and values.
Args:
df: Pandas DataFrame
nan_series: Pandas Series listing NaN fraction per column of df
'''
# iterate through df cols
for col in df:
# per col, identify all value types, and get the unique ones
col_types = df[col].apply(type).unique() # get array of types
# lets also display the data coverage, from the computed NaN frequencies, to get an idea of usefulness
col_coverage = '{:.1%}'.format( 1-nan_series[col])
print(f"{col} ({col_coverage} coverage) \nTYPES IN COLUMN: {col_types}")
for col_type in col_types:
print(f"UNIQUE VALUES PER TYPE {col_type}: ")
# per type in col, identify all unique values that are of that type
# this is a bit overkill, but it will give a detailed overview that we can skim over
#
# select from col
# values for which it is TRUE that they are of type col_type
# of those, what unique types are there
unique_type_values = df[col][\
df[col].apply(isinstance, args=(col_type,))\
].unique()
print(unique_type_values)
print('----------------------------------------')
inspect_df_types output discussion¶Given a broader project scope, inspect_df_types should be modified to create a more succinct and less verbose output; it is plenty useful and clear enough for the current narrower scope to quickly identify a number of key issues by looking over the output.
Type inconsistencies
float types are NaN values. Year column, years are denoted as strings, ints, floats, and even strings showing intervals, like '2003–2005'. Ideally, we'd like to cast all of them to datetime, but before we do, we need to address the interval values, and that will depend on the exact analysis. Options include casting intervals to pd.Timedelta, skipping them, using the lower or upper bound (perhaps for long production cycles), or creating a row per each year in that interval (perhaps for running series). Runtime values are strings that we'd like as some sort of time format, perhaps pd.Timedelta. imdbVotes are integers containing digit group separator commas as str. These should be ints.Released, DVD strings should be converted to datetime.Rated, Country, Type contain strings, but ideally we'd like to convert this to pandas Categoricals. This should improve performance, both operative and potentially predictive, and makes sense for these columns because here, the number of labels is relatively low. Value format inconsistencies
Runtime values are wildly inconsistent in how they are denoted ('3 h 30 min' '2,002 min' '324 min'); unifying will likely require parsing via a regex. Language and Director can be one or more entries, comma-separated; to-do: check if other separators are used.Rated contains a jumble of different categories, some of which should obviously be unified (like 'Not Rated' and 'Not rated'); some are difficult to discern. Notes
Movie_Movies contains movies and series. Due to the different nature, perhaps it should be split in two sets. Given the project scope, this initial data quality assessment provides a passable understanding of the data and key issues. Because data cleanup is not an explicit goal, we can instead keep in mind the identified issues and proceed to the next task, addressing issues as necessary.
We now have a sufficient overview to review project goals and to discuss what value can and cannot be extracted from the data set, and what kind of resource investment it warrants.
The task is to obtain a list of the 10 directors who produced the most movies, in descending order.
Let's first randomly sample from Director a couple of times, to see what kind of values the column contains.
dfs['Movie_Movies']['Director'].sample(10)
18724 Wallace A. Carlson 44581 Amy Schatz 131636 NaN 112742 Harris McCabe 78167 Clara van Gool 103560 Sheryl Matthys 96044 NaN 58260 William C. Judge 175349 Bob Dolman 175112 Jordan Roberts Name: Director, dtype: object
So movies can be directed by zero, one or many people. This makes sense in terms of the dataset, so there's no need to modify it, but let's make some assumptions with respect to the task:
NaN is not considered a director). director_movies_count = Counter()
for directors, director_df in dfs['Movie_Movies'].groupby(by='Director'): # movies per "director"
for director in directors.split(','): # to-do: see if other separators are used
director = director.strip()
director_movies_count[director] += len(director_df)
director_movies_count.most_common(10)
[('Jim Powers', 157),
('Dave Fleischer', 109),
('D.W. Griffith', 104),
('Lewin Fitzhamon', 95),
('Al Christie', 90),
('Georges Méliès', 88),
('Kevin Dunn', 85),
("Gilbert M. 'Broncho Billy' Anderson", 85),
('Quasarman', 85),
('James H. White', 83)]
[name for (name, num) in director_movies_count.most_common(10)]
['Jim Powers', 'Dave Fleischer', 'D.W. Griffith', 'Lewin Fitzhamon', 'Al Christie', 'Georges Méliès', 'Kevin Dunn', "Gilbert M. 'Broncho Billy' Anderson", 'Quasarman', 'James H. White']
Therefore, the list specified in Task 2 is:
['Jim Powers', 'Dave Fleischer', 'D.W. Griffith', 'Lewin Fitzhamon', 'Al Christie', 'Georges Méliès', 'Kevin Dunn', "Gilbert M. 'Broncho Billy' Anderson", 'Quasarman', 'James H. White']
We can repeat the approach we took with the directors movie count list to see which countries produce the most movies, and visualize this via an interactive choropleth graph.
THE CHOROPLETH GRAPHS WILL ONLY SHOW UP IF THIS NOTEBOOK IS RUN LOCALLY AS JUPYTER NOTEBOOK, OR VIEWED AS EXPORTED HTML - IT SEEMS THAT GITHUB AND NBVIEWER CANNOT RENDER THE INTERACTIVE CHOROPLETH GRAPHS.
countries_movies_count = Counter ()
for countries, country_df in dfs['Movie_Movies'].groupby(by='Country'):
for country in countries.split(','):
country = country.strip()
countries_movies_count[country] += len(country_df)
countries_movies_count_df = pd.DataFrame({'Country': countries_movies_count.keys(), 'Movie Count': countries_movies_count.values()})
countries_movies_count_df = countries_movies_count_df.sort_values(by='Movie Count',ascending = False).reset_index(drop=True)
We immediately see that most movies are produced in the US; let's quickly plot the top 15 countries:
countries_movies_count_df.iloc[:15].plot(\
kind = 'bar'\
, x = 'Country'
, rot = 45
, figsize=(12,5)\
);
This looks like a case of Zipf's law - a country's number of productions is inversely proportional to its rank in the movie count table.
To create the choropleth graph, we will need the country names as ISO 3-letter codes. Let's create them from the country names. Note that not all Country entries will be translated: some of the movies were made in countries that no longer exist as such; how to deal with that is a matter of debate, but for now, we'll just skip them (with one notable exception - we'll manually change "UK" to "United Kingdom").
countries_movies_count_df['Country'].replace(to_replace="UK", value ="United Kingdom", inplace=True)
countries_movies_count_df['ISO3'] = countries_movies_count_df['Country'].apply(coco.convert)
West Germany not found in regex Soviet Union not found in regex Yugoslavia not found in regex East Germany not found in regex Federal Republic of Yugoslavia not found in regex Serbia and Montenegro not found in regex Netherlands Antilles not found in regex North Vietnam not found in regex
countries_movies_count_df
| Country | Movie Count | ISO3 | |
|---|---|---|---|
| 0 | USA | 99211 | USA |
| 1 | United Kingdom | 18740 | GBR |
| 2 | Canada | 8714 | CAN |
| 3 | France | 6626 | FRA |
| 4 | Australia | 3998 | AUS |
| ... | ... | ... | ... |
| 225 | Holy See (Vatican City State) | 1 | VAT |
| 226 | Svalbard And Jan Mayen | 1 | SJM |
| 227 | Saint Lucia | 1 | LCA |
| 228 | Cook Islands | 1 | COK |
| 229 | French Guiana | 1 | GUF |
230 rows × 3 columns
px.choropleth(countries_movies_count_df,
locations='ISO3',
color = 'Movie Count',
height=600)
Because the movie counts drop off exponentially, the visualization only really highlights three countries - which accurately reflects absolute numbers, but offers little additional insight.
Just for kicks, let's transform the absolute number by dividing it by the GDP of the country in 2015, mean-normalize it, and then use a choropleth to show movie number as a function of the recent economic output.
We can obtain country GDP values by querying the World Bank API, and then inner-joining the response DataFrame with our country_movies_count_df.
# Use World Bank API
gdp_df = wb.data.DataFrame('NY.GDP.MKTP.CD',labels=True)
gdp_df
| Country | YR1960 | YR1961 | YR1962 | YR1963 | YR1964 | YR1965 | YR1966 | YR1967 | YR1968 | ... | YR2011 | YR2012 | YR2013 | YR2014 | YR2015 | YR2016 | YR2017 | YR2018 | YR2019 | YR2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| economy | |||||||||||||||||||||
| ZWE | Zimbabwe | 1.052990e+09 | 1.096647e+09 | 1.117602e+09 | 1.159512e+09 | 1.217138e+09 | 1.311436e+09 | 1.281750e+09 | 1.397002e+09 | 1.479600e+09 | ... | 1.410192e+10 | 1.711485e+10 | 1.909102e+10 | 1.949552e+10 | 1.996312e+10 | 2.054868e+10 | 2.204090e+10 | 2.431156e+10 | 2.144076e+10 | NaN |
| ZMB | Zambia | 7.130000e+08 | 6.962857e+08 | 6.931429e+08 | 7.187143e+08 | 8.394286e+08 | 1.082857e+09 | 1.264286e+09 | 1.368000e+09 | 1.605857e+09 | ... | 2.345952e+10 | 2.550306e+10 | 2.804555e+10 | 2.715073e+10 | 2.124334e+10 | 2.095475e+10 | 2.586817e+10 | 2.631214e+10 | 2.330977e+10 | NaN |
| YEM | Yemen, Rep. | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.272642e+10 | 3.540134e+10 | 4.041524e+10 | 4.320647e+10 | 4.245062e+10 | 3.093598e+10 | 2.673614e+10 | 2.348627e+10 | 2.258108e+10 | NaN |
| PSE | West Bank and Gaza | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.118610e+10 | 1.220840e+10 | 1.351550e+10 | 1.398970e+10 | 1.397240e+10 | 1.540540e+10 | 1.612800e+10 | 1.627660e+10 | NaN | NaN |
| VIR | Virgin Islands (U.S.) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.239000e+09 | 4.095000e+09 | 3.762000e+09 | 3.622000e+09 | 3.748000e+09 | 3.863000e+09 | 3.855000e+09 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| EAS | East Asia & Pacific | 1.532581e+11 | 1.537058e+11 | 1.572680e+11 | 1.754950e+11 | 2.016310e+11 | 2.245756e+11 | 2.510803e+11 | 2.720544e+11 | 2.997578e+11 | ... | 1.968738e+13 | 2.107208e+13 | 2.131856e+13 | 2.200218e+13 | 2.191164e+13 | 2.265859e+13 | 2.422269e+13 | 2.635135e+13 | 2.692466e+13 | NaN |
| EAR | Early-demographic dividend | 1.517106e+11 | 1.528336e+11 | 1.586219e+11 | 1.681859e+11 | 1.930554e+11 | 2.093716e+11 | 2.064487e+11 | 2.169385e+11 | 2.348353e+11 | ... | 9.613821e+12 | 1.008164e+13 | 1.025981e+13 | 1.062047e+13 | 1.031164e+13 | 1.055044e+13 | 1.139972e+13 | 1.164598e+13 | 1.198987e+13 | NaN |
| CEB | Central Europe and the Baltics | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.449107e+12 | 1.352689e+12 | 1.422194e+12 | 1.464900e+12 | 1.292087e+12 | 1.320631e+12 | 1.463498e+12 | 1.644436e+12 | 1.668852e+12 | NaN |
| CSS | Caribbean small states | 1.880306e+09 | 2.038302e+09 | 2.153896e+09 | 2.290314e+09 | 2.470265e+09 | 2.660946e+09 | 2.888648e+09 | 3.102515e+09 | 3.083591e+09 | ... | 6.966827e+10 | 7.205112e+10 | 7.341759e+10 | 7.443367e+10 | 7.282847e+10 | 6.919134e+10 | 7.145256e+10 | 7.496727e+10 | 7.772171e+10 | NaN |
| ARB | Arab World | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.589161e+10 | ... | 2.494029e+12 | 2.773499e+12 | 2.856372e+12 | 2.894391e+12 | 2.534724e+12 | 2.469440e+12 | 2.516121e+12 | 2.771384e+12 | 2.817415e+12 | NaN |
264 rows × 62 columns
countries_movies_count_df
| Country | Movie Count | ISO3 | |
|---|---|---|---|
| 0 | USA | 99211 | USA |
| 1 | United Kingdom | 18740 | GBR |
| 2 | Canada | 8714 | CAN |
| 3 | France | 6626 | FRA |
| 4 | Australia | 3998 | AUS |
| ... | ... | ... | ... |
| 225 | Holy See (Vatican City State) | 1 | VAT |
| 226 | Svalbard And Jan Mayen | 1 | SJM |
| 227 | Saint Lucia | 1 | LCA |
| 228 | Cook Islands | 1 | COK |
| 229 | French Guiana | 1 | GUF |
230 rows × 3 columns
gdp2015 = gdp_df['YR2015'].reset_index()
gdp2015
| economy | YR2015 | |
|---|---|---|
| 0 | ZWE | 1.996312e+10 |
| 1 | ZMB | 2.124334e+10 |
| 2 | YEM | 4.245062e+10 |
| 3 | PSE | 1.397240e+10 |
| 4 | VIR | 3.748000e+09 |
| ... | ... | ... |
| 259 | EAS | 2.191164e+13 |
| 260 | EAR | 1.031164e+13 |
| 261 | CEB | 1.292087e+12 |
| 262 | CSS | 7.282847e+10 |
| 263 | ARB | 2.534724e+12 |
264 rows × 2 columns
# do an inner join based on the country codes
movie_gdp_df = pd.merge(countries_movies_count_df, gdp2015, how="inner", left_on="ISO3", right_on="economy")
movie_gdp_df
| Country | Movie Count | ISO3 | economy | YR2015 | |
|---|---|---|---|---|---|
| 0 | USA | 99211 | USA | USA | 1.822470e+13 |
| 1 | United Kingdom | 18740 | GBR | GBR | 2.928591e+12 |
| 2 | Canada | 8714 | CAN | CAN | 1.556130e+12 |
| 3 | France | 6626 | FRA | FRA | 2.438208e+12 |
| 4 | Australia | 3998 | AUS | AUS | 1.351694e+12 |
| ... | ... | ... | ... | ... | ... |
| 205 | Tonga | 1 | TON | TON | 4.370062e+08 |
| 206 | Faroe Islands | 1 | FRO | FRO | 2.518096e+09 |
| 207 | Sao Tome And Principe | 1 | STP | STP | 3.160654e+08 |
| 208 | Turkmenistan | 1 | TKM | TKM | 3.579971e+10 |
| 209 | Saint Lucia | 1 | LCA | LCA | 1.808080e+09 |
210 rows × 5 columns
movie_gdp_df['Movie Count / GDP (2015)'] = movie_gdp_df['Movie Count'] / movie_gdp_df['YR2015']
Mean normalization (standard score): $$z = \frac{x - \mu}{\sigma}$$
# mean normalization:
movie_gdp_df['normalized movie count per GDP'] = (movie_gdp_df['Movie Count / GDP (2015)']-movie_gdp_df['Movie Count / GDP (2015)'].mean())/(movie_gdp_df['Movie Count / GDP (2015)'].std())
# identify outliers; trying different values a couple of times yields 2 a good value
movie_gdp_df[movie_gdp_df['normalized movie count per GDP']>2]
| Country | Movie Count | ISO3 | economy | YR2015 | Movie Count / GDP (2015) | normalized movie count per GDP | |
|---|---|---|---|---|---|---|---|
| 114 | San Marino | 21 | SMR | SMR | 1.419395e+09 | 1.479504e-08 | 4.659508 |
| 183 | Marshall Islands | 3 | MHL | MHL | 1.845996e+08 | 1.625139e-08 | 5.185327 |
| 184 | Micronesia | 3 | FSM | FSM | 3.164899e+08 | 9.478975e-09 | 2.740131 |
| 196 | Tuvalu | 1 | TUV | TUV | 3.549207e+07 | 2.817530e-08 | 9.490484 |
px.choropleth(movie_gdp_df.drop(movie_gdp_df[movie_gdp_df['normalized movie count per GDP']>2].index),
locations='ISO3',
color = 'normalized movie count per GDP',
title = 'Number of movies as a function of recent GDP',
height=600)
This map shows the number of movies as a function of recent GDP, normalized and with outliers removed. So far, this is nothing more than a thought experiment, but this could be extended to a sophisticated analysis, linking time-specific GDP to time-specific movie production, and including population in the ranking.
Depending on the analysis goals, we might, for example, be interested in the distributions of the data, to get a better feel for the data set.
Looking further down the analytical / predictive pipeline, estimating the distribution functions would be important for Bayesian machine learning approaches.
Continuing where we left off in Task 2, let's plot the number of movies per director.
dir_mov_count_df = pd.DataFrame(\
director_movies_count.most_common()\
, columns=['Director', 'Number of productions'\
])
Here, because we have almost 100k entries, a bar plot makes little sense, and using a line plot is better.
dir_mov_count_df.plot(y='Number of productions'\
, title="Number of produced movies per director index"\
, xlabel="Director index in rank table"
,figsize=(12,5));
It also looks like a case of Zipf's law - a director's number of productions is inversely proportional to his rank in the movie count table. Let's see how the log-log plot looks:
dir_mov_count_df.plot(y='Number of productions'\
, title="Log-log plot of the number of produced movies per director index"\
,figsize=(12,5)
,loglog=True);
This looks a lot like Zipf's distribution. If we wanted to investigate further, we could attempt to estimate $\alpha$ using scipy.optimize.curve_fit and see how well the distributions correlate.
Let's have a look at how imdbRating is distributed.
sns.displot(height = 5\
, aspect = 2\
, data = dfs['Movie_Movies']\
, x = dfs['Movie_Movies']['imdbRating']\
, kde = True
, binwidth = .1
);
Let's also gather some basic statistics:
dfs['Movie_Movies']['imdbRating'].describe()
count 62073.000000 mean 6.416236 std 1.360472 min 1.000000 25% 5.600000 50% 6.500000 75% 7.300000 max 10.000000 Name: imdbRating, dtype: float64
We see that the rating mean is 6.41, based off a rating from 1 to 10 (so an average rating would be 5.5). This raises some questions:
We could further investigate by, for instance, looking at how votes are distributed along time after a movie's release, include user reviews, introduce weighing functions, and reference additional sources.
Regarding the distribution function of the ratings, which is hinted at here by the line (kernel density estimator) - it is not immediately clear what the underlying distribution function is.
A quick web search reveals that this may be a Lévy skew distribution, but this would require further investigation.
inspect_df_types(dfs['Movie_Movies'], nan_freqs['Movie_Movies'])
Awards (11.4% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['1 win.' '1 nomination.' '4 wins & 5 nominations.' ... 'Nominated for 1 Oscar. Another 7 wins & 13 nominations.' 'Won 1 Golden Globe. Another 4 wins & 23 nominations.' 'Nominated for 5 Oscars. Another 10 wins & 75 nominations.'] ---------------------------------------- Country (92.9% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['USA' 'Sri Lanka' 'UK' ... 'France, Germany, USA, Japan' 'France, Hungary, USA' 'UK, Germany, Japan, USA'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- DVD (8.9% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['03 Feb 2014' '23 Nov 2004' '31 Jul 2001' ... '26 Oct 2006' '29 Jul 1999' '30 Jun 2000'] ---------------------------------------- Director (84.7% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['Rose Cummings' 'James Byrne' 'Dimitri Buchowetzki' ... 'Michael Corrente' 'Hampton Fancher' 'Robert Marcarelli'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Language (85.7% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['English' 'Sinhalese' 'Korean' ... 'English, Latin, Swedish, Norse, Old, Danish' 'English, Assyrian Neo-Aramaic, Portuguese, Italian, Aramaic' 'English, Japanese, Arabic, Persian'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Plot (54.6% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ["Rachel constantly hears her baby cry from the bedroom, but David, her husband, won't let her go to the baby. There's something in there he doesn't want her to see, to know. As the cries ..." 'The struggle against unfortunate circumstances unfolds in two intimate portraits - a young woman reclaims herself and her life after injury; and an older woman battles to hang on as dementia invades her life.' "A Gift introduces Samuel Green, Washington State's first poet laureate, whose poetry inspired The Only Time We Have. Samuel uses poetry to make sense of his life experiences as well as to better understand others." ... 'The life and career of a legendary comedian, Andy Kaufman.' 'The alumni cast of a space opera television series have to play their roles as the real thing when an alien race needs their help. However, they also have to defend both Earth and the alien race from a reptilian warlord.' 'In late 1950s New York, Tom Ripley, a young underachiever, is sent to Italy to retrieve Dickie Greenleaf, a rich and spoiled millionaire playboy. But when the errand fails, Ripley takes extreme measures.'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Poster (27.1% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['https://images-na.ssl-images-amazon.com/images/M/MV5BYmZmMzUxOTMtNDljOS00YjJmLThiNzYtMWU2YTkzZTZkMjZjXkEyXkFqcGdeQXVyNjc4NjAwNjY@._V1_SX300.jpg' 'https://images-na.ssl-images-amazon.com/images/M/MV5BNTk0YzY5M2EtMGMyMi00M2E3LTk0ZTAtMjI1NmVhNGFjYmI2L2ltYWdlXkEyXkFqcGdeQXVyMjExMTI5MzU@._V1_SX300.jpg' 'https://images-na.ssl-images-amazon.com/images/M/MV5BMTgxNjk4ODk5MF5BMl5BanBnXkFtZTgwNjc2NDU4MDE@._V1_SX300.jpg' ... 'https://ia.media-imdb.com/images/M/MV5BNDI1Mjc3MzAtZDk0OS00OTZlLTlhZjktNzA3ODgwZGY2NWIwXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg' 'https://ia.media-imdb.com/images/M/MV5BNmZlNTY5YjQtZTU5ZC00MzcyLWI1NDMtNjA0ZjQxMmQwYjJmL2ltYWdlXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_SX300.jpg' 'https://images-na.ssl-images-amazon.com/images/M/MV5BODA3NDhiZjYtYTk2NS00ZWYwLTljYTQtMjU0NzcyMGEzNTU2L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg'] ---------------------------------------- Production (9.2% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['Khoury A Marriot Productions' 'Columbia Pictures' 'Sony Pictures Home Entertainment' ... 'New Line Cinema.' 'Providence Entertainment' 'Happy Madison Productions'] ---------------------------------------- Rated (13.4% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['AL' 'G' 'Approved' 'APPROVED' 'NOT RATED' 'R' 'X' 'PASSED' 'UNRATED' 'TV-PG' 'Not Rated' 'PG' 'PG-13' 'Refused Classification' 'TV-14' 'TV-G' 'Unrated' '18A' 'TV-MA' 'R18' 'GP' 'U' 'Passed' 'TV-Y' 'KT' 'M' 'NR' 'Btl' 'NC-17' 'AO' '18' 'E' 'E10+' 'o.Al.' '12' 'TV-Y7' '6' 'R-18' '(Banned)' '14A' 'Atp' 'K-3' 'Livre' '16' 'K-16' '15' 'A' 'M/12' 'K-15' 'OPEN' '-12' '18+' '7' 'T' 'L' 'S' '13+' 'VM18' 'B' 'K-7' 'M/PG' '(BANNED)' '14' 'K-18' 'K-12' 'III' 'TV-13' 'AA' '13' 'K-11' 'M/6' 'R18+' '11' '12A' 'U/A' 'Not rated' 'MA' '14+' 'IIA' 'M18'] ---------------------------------------- Released (65.5% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['26 Apr 2012' '27 Mar 1926' '27 May 2013' ... '09 Jul 1999' '04 Aug 1999' '27 Aug 1999'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Runtime (66.9% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['20 min' '9 min' '50 min' '2 min' '23 min' '42 min' '12 min' '10 min' '18 min' '7 min' '97 min' '2 h 5 min' '93 min' '27 min' '75 min' '85 min' '3 min' '13 min' '70 min' '72 min' '6 min' '26 min' '82 min' '15 min' '24 min' '5 min' '25 min' '124 min' '46 min' '55 min' '81 min' '56 min' '87 min' '73 min' '126 min' '120 min' '90 min' '8 min' '76 min' '110 min' '30 min' '4 min' '96 min' '29 min' '60 min' '86 min' '59 min' '102 min' '108 min' '19 min' '2 h' '54 min' '118 min' '94 min' '150 min' '21 min' '14 min' '1 min' '45 min' '89 min' '22 min' '38 min' '78 min' '33 min' '95 min' '51 min' '183 min' '16 min' '92 min' '17 min' '65 min' '105 min' '52 min' '121 min' '135 min' '140 min' '11 min' '48 min' '83 min' '100 min' '180 min' '34 min' '58 min' '35 min' '43 min' '101 min' '136 min' '123 min' '103 min' '177 min' '79 min' '112 min' '28 min' '64 min' '98 min' '137 min' '91 min' '74 min' '77 min' '106 min' '114 min' '176 min' '63 min' '49 min' '80 min' '44 min' '148 min' '40 min' '69 min' '37 min' '62 min' '119 min' '217 min' '225 min' '145 min' '84 min' '71 min' '104 min' '31 min' '115 min' '107 min' '88 min' '153 min' '61 min' '47 min' '57 min' '125 min' '152 min' '168 min' '1 h 25 min' '1 h 30 min' '67 min' '132 min' '131 min' '1 h 10 min' '129 min' '1 h 7 min' '32 min' '1 h 50 min' '127 min' '1 h 4 min' '68 min' '134 min' '99 min' '175 min' '2 h 11 min' '167 min' '53 min' '117 min' '1 h 35 min' '138 min' '240 min' '1 h' '1 h 59 min' '1 h 1 min' '39 min' '169 min' '122 min' '109 min' '1 h 43 min' '3 h 44 min' '116 min' '2,001 min' '166 min' '130 min' '111 min' '158 min' '66 min' '1 h 20 min' '113 min' '1 h 45 min' '1 h 29 min' '141 min' '1 h 11 min' '3 h' '187 min' '1 h 28 min' '1 h 48 min' '149 min' '144 min' '1 h 23 min' '2 h 48 min' '139 min' '155 min' '189 min' '245 min' '1 h 15 min' '231 min' '173 min' '285 min' '960 min' '237 min' '186 min' '2 h 7 min' '208 min' '128 min' '157 min' '143 min' '36 min' '41 min' '1 h 31 min' '156 min' '241 min' '1 h 40 min' '1 h 9 min' '164 min' '2 h 27 min' '1 h 42 min' '300 min' '1 h 6 min' '160 min' '182 min' '4 h' '260 min' '252 min' '1 h 16 min' '188 min' '163 min' '2 h 10 min' '133 min' '1 h 5 min' '159 min' '600 min' '1 h 12 min' '1 h 26 min' '1 h 19 min' '333 min' '400 min' '142 min' '2 h 17 min' '178 min' '181 min' '220 min' '724 min' '317 min' '185 min' '232 min' '1 h 36 min' '540 min' '146 min' '1 h 27 min' '151 min' '1 h 17 min' '210 min' '1 h 14 min' '1 h 46 min' '162 min' '1 h 13 min' '147 min' '1,320 min' '165 min' '200 min' '323 min' '250 min' '1 h 47 min' '275 min' '359 min' '227 min' '172 min' '296 min' '249 min' '213 min' '1 h 8 min' '236 min' '3,608 min' '195 min' '270 min' '224 min' '2 h 50 min' '2 h 30 min' '1 h 56 min' '1 h 44 min' '170 min' '1 h 39 min' '3 h 2 min' '1 h 54 min' '1 h 33 min' '341 min' '283 min' '1 h 24 min' '179 min' '1 h 18 min' '360 min' '334 min' '1 h 51 min' '390 min' '3 h 55 min' '174 min' '192 min' '1 h 22 min' '239 min' '194 min' '1 h 38 min' '209 min' '2 h 56 min' '5 h 30 min' '274 min' '2 h 32 min' '1 h 3 min' '199 min' '2 h 19 min' '1 h 2 min' '154 min' '222 min' '3 h 40 min' '1 h 58 min' '3 h 57 min' '2 h 58 min' '230 min' '193 min' '191 min' '315 min' '1 h 37 min' '1 h 55 min' '353 min' '5 h' '330 min' '4 h 16 min' '338 min' '2 h 18 min' '320 min' '6 h 12 min' '495 min' '380 min' '256 min' '312 min' '267 min' '206 min' '1 h 21 min' '289 min' '235 min' '207 min' '2 h 15 min' '273 min' '2 h 33 min' '2 h 9 min' '219 min' '2 h 2 min' '3 h 5 min' '2 h 24 min' '414 min' '1 h 34 min' '228 min' '2 h 14 min' '1 h 52 min' '1,260 min' '1 h 41 min' '161 min' '281 min' '2 h 49 min' '1 h 32 min' '3 h 7 min' '244 min' '2 h 20 min' '3 h 13 min' '480 min' '204 min' '3 h 4 min' '286 min' '251 min' '247 min' '196 min' '205 min' '218 min' '780 min' '184 min' '313 min' '212 min' '2 h 37 min' '4 h 38 min' '473 min' '3 h 25 min' '2 h 34 min' '3 h 12 min' '469 min' '1,080 min' '4 h 49 min' '1.5 min' '14,400 min' '2 h 44 min' '255 min' '720 min' '190 min' '536 min' '202 min' '15 h 40 mi' '261 min' '2,000 min' '215 min' '216 min' '4 h 23 min' '551 min' '6 h' '211 min' '370 min' '319 min' '1,000 min' '570 min' '2 h 4 min' '234 min' '5 h 18 min' '171 min' '440 min' '310 min' '3 h 15 min' '223 min' '355 min' '3 h 19 min' '460 min' '1 h 49 min' '14 h' '1 h 53 min' '33 h 20 mi' '2 h 13 min' '2 h 3 min' '363 min' '259 min' '201 min' '356 min' '493 min' '221 min' '233 min' '304 min' '278 min' '321 min' '1,440 min' '964 min' '302 min' '265 min' '420 min' '602 min' '1,669 min' '305 min' '350 min' '2 h 47 min' '1 h 57 min' '284 min' '2 h 42 min' '318 min' '197 min' '337 min' '254 min' '257 min' '865 min' '203 min' '2 h 1 min' '516 min' '282 min' '263 min' '2 h 29 min' '258 min' '2 h 23 min' '376 min' '246 min' '226 min' '2 h 26 min' '2 h 16 min' '1,500 min' '298 min' '501 min' '325 min' '747 min' '579 min' '450 min' '500 min' '3 h 17 min' '340 min' '1,559 min' '404 min' '2 h 43 min' '328 min' '3 h 26 min' '431 min' '416 min' '280 min' '4 h 17 min' '268 min' '214 min' '3 h 10 min' '290 min' '198 min' '4 h 18 min' '478 min' '5,220 min' '3 h 52 min' '326 min' '730 min' '375 min' '383 min' '2 h 21 min' '630 min' '398 min' '248 min' '288 min' '294 min' '7 h 28 min' '364 min' '2,880 min' '366 min' '336 min' '327 min' '715 min' '348 min' '4 h 5 min' '3 h 30 min' '2,002 min' '324 min' '303 min' '2 h 59 min' '2 h 12 min' '345 min' '357 min' '1,411 min' '3 h 14 min' '388 min' '262 min' '381 min' '4 h 40 min' '238 min' '3 h 20 min' '266 min' '243 min' '368 min' '229 min' '2 h 6 min' '38 h 43 mi' '485 min' '422 min' '2 h 51 min' '308 min' '545 min' '272 min' '307 min' '352 min' '271 min'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Title (100.0% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ["Baby's Breath" 'Winter Trees' 'The Crown of Lies' ... 'Any Given Sunday' 'Galaxy Quest' 'The Talented Mr. Ripley'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Type (100.0% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['movie' 'series'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- Website (3.8% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['http://www.facebook.com/TwoNightStandMovie' 'http://www.facebook.com/FullmetalAlchemistMovie2' 'http://www.fourthemovie.com/' ... 'http://www.universalpictures.com/manonthemoon' 'http://www.amazon.com/exec/obidos/subst/video/misc/dreamworks/galaxy-quest/gq-home.html/103-6831095-1845408' 'http://www.talentedmrripley.com'] ---------------------------------------- Year (100.0% coverage) TYPES IN COLUMN: [<class 'int'> <class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'int'>: [2012 2008 1926 2013 2014 2009 1915 1966 2005 2016 2017 2002 2015 2004 2011 1986 2001 1953 1990 1970 1934 1997 1908 2010 1931 2006 1971 2007 1989 1950 1919 1973 1902 1959 1964 1916 1995 1914 1939 1991 1912 1918 1975 1967 1957 1994 1996 1992 1913 1910 1993 1969 1907 1911 1999 1987 1972 1937 1958 1925 1928 1983 2003 1984 1998 2000 2018 1956 1988 1982 1955 1932 1952 1906 1960 1933 1927 1917 1974 1901 1985 1980 1977 1920 1899 1962 1981 1979 1921 1936 1946 1930 1905 1965 1923 1942 1961 1903 1898 1943 1978 1909 1948 1924 1929 1976 1949 1947 1922 1940 1938 1954 1941 1900 1935 1944 1968 1904 1951 1895 1963 2019 1945 1896 1897 2020 1889 1894 2021 1890 2023 2022] UNIQUE VALUES PER TYPE <class 'float'>: [1919.0 2006.0 2008.0 2014.0 2017.0 2007.0 1997.0 2009.0 1911.0 1996.0 2015.0 1955.0 2016.0 1989.0 1930.0 1981.0 1999.0 1990.0 1915.0 2000.0 2013.0 2003.0 1912.0 1916.0 2018.0 2011.0 2001.0 1948.0 2010.0 1987.0 2012.0 2005.0 1977.0 1967.0 2002.0 1998.0 1921.0 1910.0 1937.0 1992.0 1936.0 1963.0 1988.0 1982.0 2004.0 1913.0 1966.0 1920.0 1968.0 1949.0 1957.0 1932.0 1959.0 1909.0 1993.0 1995.0 1985.0 1958.0 1939.0 1941.0 1908.0 1938.0 2020.0 1976.0 1984.0 1898.0 1974.0 1973.0 1946.0 1965.0 1970.0 1994.0 1918.0 1944.0 1952.0 1980.0 1942.0 1940.0 1926.0 1979.0 1978.0 1927.0 1947.0 1971.0 1956.0 1914.0 1905.0 1928.0 1972.0 1960.0 1934.0 1950.0 1896.0 1986.0 1962.0 1954.0 1901.0 1975.0 1991.0 1931.0 1943.0 1945.0 1983.0 1935.0 1929.0 1953.0 1917.0 1906.0 1951.0 1903.0 1969.0 1925.0 1924.0 2019.0 1964.0 1933.0 1922.0 1923.0 1904.0 1902.0 1907.0 1961.0 1900.0 1899.0 1897.0 1895.0 2021.0 1891.0 nan 1894.0] UNIQUE VALUES PER TYPE <class 'str'>: ['1896' '1938' '1987' '2008' '2012' '2015' '1978' '2011' '2007' '2014' '2004' '1912' '1997' '2016' '2013' '1966' '1994' '1915' '1913' '2005' '2003' '1991' '1930' '1903' '1988' '1919' '2009' '1963' '2010' '1932' '1985' '1962' '1983' '1990' '1968' '1981' '1908' '1906' '1993' '2006' '1998' '1898' '1961' '1936' '1964' '1989' '1960' '1909' '1910' '1923' '1944' '1914' '2001' '1950' '1999' '1971' '1955' '1952' '2017' '1995' '1925' '1986' '1900' '1957' '1982' '1975' '1973' '1917' '1979' '2002' '1970' '1924' '1992' '1967' '1996' '2000' '1969' '1977' '1918' '1911' '1980' '1947' '1934' '1907' '1904' '1902' '1976' '1922' '1928' '1956' '1905' '1972' '1921' '1984' '1931' '1937' '1974' '1941' '1959' '1920' '2018' '1926' '1942' '1927' '1939' '1916' '1951' '1935' '1954' '1933' '1949' '1948' '1929' '2020' '1946' '1965' '1940' '1899' '2019' '1953' '1958' '1901' '1943' '1895' '1945' '1897' '2021' '1894' '2003–2005' '2007–2011' '2010–2013' '2012–2016'] ---------------------------------------- imdbID (100.0% coverage) TYPES IN COLUMN: [<class 'str'> <class 'float'>] UNIQUE VALUES PER TYPE <class 'str'>: ['tt2268369' 'tt1560760' 'tt0016750' ... 'tt0125664' 'tt0177789' 'tt0134119'] UNIQUE VALUES PER TYPE <class 'float'>: [nan] ---------------------------------------- imdbRating (34.7% coverage) TYPES IN COLUMN: [<class 'float'>] UNIQUE VALUES PER TYPE <class 'float'>: [ nan 6.9 2.4 6.3 5.5 5. 7.6 6.8 7.2 7.4 7.1 5.4 5.9 5.8 5.1 4.8 3.7 5.3 7.3 7.9 7.8 7. 7.5 6.6 6.4 5.7 4. 6.1 6.5 3.1 6.2 5.6 3.5 6.7 8.6 3.6 4.1 8.5 6. 8.1 4.7 2.3 3.9 8. 7.7 8.9 3.2 8.2 9. 4.4 4.3 4.6 8.4 3.3 2.6 2.8 8.3 4.9 8.8 3.4 5.2 1.9 2.5 4.5 9.2 3.8 8.7 2.9 4.2 9.1 2.1 9.7 9.3 3. 2. 1.2 2.7 10. 1.8 9.4 9.8 1.4 1. 1.3 9.5 9.6 1.6 2.2 9.9 1.5 1.7 1.1] ---------------------------------------- imdbVotes (34.7% coverage) TYPES IN COLUMN: [<class 'float'> <class 'str'>] UNIQUE VALUES PER TYPE <class 'float'>: [nan] UNIQUE VALUES PER TYPE <class 'str'>: ['35' '345' '11' ... '110,566' '132,448' '151,938'] ----------------------------------------